48 research outputs found

    In-Vivo Bytecode Instrumentation for Improving Privacy on Android Smartphones in Uncertain Environments

    Get PDF
    In this paper we claim that an efficient and readily applicable means to improve privacy of Android applications is: 1) to perform runtime monitoring by instrumenting the application bytecode and 2) in-vivo, i.e. directly on the smartphone. We present a tool chain to do this and present experimental results showing that this tool chain can run on smartphones in a reasonable amount of time and with a realistic effort. Our findings also identify challenges to be addressed before running powerful runtime monitoring and instrumentations directly on smartphones. We implemented two use-cases leveraging the tool chain: BetterPermissions, a fine-grained user centric permission policy system and AdRemover an advertisement remover. Both prototypes improve the privacy of Android systems thanks to in-vivo bytecode instrumentation.Comment: ISBN: 978-2-87971-111-

    Challenges and Outlook in Machine Learning-based Malware Detection for Android

    Get PDF
    Just like in traditional desktop computing, one of the major security issues in mobile computing lies in malicious software. Several recent studies have shown that Android, as today’s most widespread Operating System, is the target of most of the new families of malware. Manually analysing an Android application to determine whether it is malicious or not is a time- consuming process. Furthermore, because of the complexity of analysing an application, this task can only be conducted by highly-skilled—hence hard to come by—professionals. Researchers naturally sought to transfer this process from humans to computers to lower the cost of detecting malware. Machine-Learning techniques, looking at patterns amongst known malware and inferring models of what discriminates malware from goodware, have long been summoned to build malware detectors. The vast quantity of data involved in malware detection, added to the fact that we do not know a priori how to express in technical terms the difference between malware and goodware, indeed makes the malware detection question a seemingly textbook example of a possible Machine- Learning application. Despite the vast amount of literature published on the topic of detecting malware with machine- learning, malware detection is not a solved problem. In this Thesis, we investigate issues that affect performance evaluation and that thus may render current machine learning-based mal- ware detectors for Android hardly usable in practical settings, and we propose an approach to overcome those issues. While the experiments presented in this thesis all rely on feature-sets obtained through lightweight static analysis, several of our findings could apply equally to all Machine Learning-based malware detection approaches. In the first part of this thesis, background information on machine-learning and on malware detection is provided, and the related work is described. A snapshot of the malware landscape in Android application markets is then presented. The second part discusses three pitfalls hindering the evaluation of malware detectors. We show with extensive experiments how validation methodology, History-unaware dataset construction and the choice of a ground truth can heavily interfere with the performance results of malware detectors. In a third part, we present an practical approach to detect Android Malware in real-world settings. We then propose several research paths to get closer to our long term goal of building practical, dependable and predictable Android Malware detectors

    A First Look at Android Applications in Google Play related to Covid-19

    Get PDF
    Due to the convenience of access-on-demand to information and business solutions, mobile apps have become an important asset in the digital world. In the context of the Covid-19 pandemic, app developers have joined the response effort in various ways by releasing apps that target different user bases (e.g., all citizens or journalists), offer different services (e.g., location tracking or diagnostic-aid), provide generic or specialized information, etc. While many apps have raised some concerns by spreading misinformation or even malware, the literature does not yet provide a clear landscape of the different apps that were developed. In this study, we focus on the Android ecosystem and investigate Covid-related Android apps. In a best-effort scenario, we attempt to systematically identify all relevant apps and study their characteristics with the objective to provide a First taxonomy of Covid-related apps, broadening the relevance beyond the implementation of contact tracing. Overall, our study yields a number of empirical insights that contribute to enlarge the knowledge on Covid-related apps: (1) Developer communities contributed rapidly to the Covid-19, with dedicated apps released as early as January 2020; (2) Covid-related apps deliver digital tools to users (e.g., health diaries), serve to broadcast information to users (e.g., spread statistics), and collect data from users (e.g., for tracing); (3) Covid-related apps are less complex than standard apps; (4) they generally do not seem to leak sensitive data; (5) in the majority of cases, Covid-related apps are released by entities with past experience on the market, mostly official government entities or public health organizations.Comment: Accepted in Empirical Software Engineering under reference: EMSE-D-20-00211R

    LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning

    Full text link
    Transformer-based models, such as BERT, have revolutionized various language tasks, but still struggle with large file classification due to their input limit (e.g., 512 tokens). Despite several attempts to alleviate this limitation, no method consistently excels across all benchmark datasets, primarily because they can only extract partial essential information from the input file. Additionally, they fail to adapt to the varied properties of different types of large files. In this work, we tackle this problem from the perspective of correlated multiple instance learning. The proposed approach, LaFiCMIL, serves as a versatile framework applicable to various large file classification tasks covering binary, multi-class, and multi-label classification tasks, spanning various domains including Natural Language Processing, Programming Language Processing, and Android Analysis. To evaluate its effectiveness, we employ eight benchmark datasets pertaining to Long Document Classification, Code Defect Detection, and Android Malware Detection. Leveraging BERT-family models as feature extractors, our experimental results demonstrate that LaFiCMIL achieves new state-of-the-art performance across all benchmark datasets. This is largely attributable to its capability of scaling BERT up to nearly 20K tokens, running on a single Tesla V-100 GPU with 32G of memory.Comment: 12 pages; update results; manuscript revisio

    A Pre-Trained BERT Model for Android Applications

    Full text link
    The automation of an increasingly large number of software engineering tasks is becoming possible thanks to Machine Learning (ML). One foundational building block in the application of ML to software artifacts is the representation of these artifacts (e.g., source code or executable code) into a form that is suitable for learning. Many studies have leveraged representation learning, delegating to ML itself the job of automatically devising suitable representations. Yet, in the context of Android problems, existing models are either limited to coarse-grained whole-app level (e.g., apk2vec) or conducted for one specific downstream task (e.g., smali2vec). Our work is part of a new line of research that investigates effective, task-agnostic, and fine-grained universal representations of bytecode to mitigate both of these two limitations. Such representations aim to capture information relevant to various low-level downstream tasks (e.g., at the class-level). We are inspired by the field of Natural Language Processing, where the problem of universal representation was addressed by building Universal Language Models, such as BERT, whose goal is to capture abstract semantic information about sentences, in a way that is reusable for a variety of tasks. We propose DexBERT, a BERT-like Language Model dedicated to representing chunks of DEX bytecode, the main binary format used in Android applications. We empirically assess whether DexBERT is able to model the DEX language and evaluate the suitability of our model in two distinct class-level software engineering tasks: Malicious Code Localization and Defect Prediction. We also experiment with strategies to deal with the problem of catering to apps having vastly different sizes, and we demonstrate one example of using our technique to investigate what information is relevant to a given task

    A First Look at Android Applications in Google Play related to Covid-19

    Get PDF
    Due to the convenience of access-on-demand to information and business solutions, mobile apps have become an important asset in the digital world. In the context of the Covid-19 pandemic, app developers have joined the response effort in various ways by releasing apps that target different user bases (e.g., all citizens or journalists), offer different services (e.g., location tracking or diagnostic-aid), provide generic or specialized information, etc. While many apps have raised some concerns by spreading misinformation or even malware, the literature does not yet provide a clear landscape of the different apps that were developed. In this study, we focus on the Android ecosystem and investigate Covid-related Android apps. In a best-effort scenario, we attempt to systematically identify all relevant apps and study their characteristics with the objective to provide a First taxonomy of Covid related apps, broadening the relevance beyond the implementation of contact tracing. Overall, our study yields a number of empirical insights that contribute to enlarge the knowledge on Covid-related apps: (1) Developer communities contributed rapidly to the Covid-19, with dedicated apps released as early as January 2020; (2) Covid-related apps deliver digital tools to users (e.g., health diaries), serve to broadcast information to users (e.g., spread statistics), and collect data from users (e.g., for tracing); (3) Covid-related apps are less complex than standard apps; (4) they generally do not seem to leak sensitive data; (5) in the majority of cases, Covid-related apps are released by entities with past experience on the market, mostly official government entities or public health organizations

    Lessons Learnt on Reproducibility in Machine Learning Based Android Malware Detection

    Get PDF
    A well-known curse of computer security research is that it often produces systems that, while technically sound, fail operationally. To overcome this curse, the community generally seeks to assess proposed systems under a variety of settings in order to make explicit every potential bias. In this respect, recently, research achievements on machine learning based malware detection are being considered for thorough evaluation by the community. Such an effort of comprehensive evaluation supposes first and foremost the possibility to perform an independent reproduction study in order to sharpen evaluations presented by approaches’ authors. The question Can published approaches actually be reproduced? thus becomes paramount despite the little interest such mundane and practical aspects seem to attract in the malware detection field. In this paper, we attempt a complete reproduction of five Android Malware Detectors from the literature and discuss to what extent they are “reproducible”. Notably, we provide insights on the implications around the guesswork that may be required to finalise a working implementation. Finally, we discuss how barriers to reproduction could be lifted, and how the malware detection field would benefit from stronger reproducibility standards—like many various fields already have

    Android Malware Detection Using BERT

    Get PDF
    In this paper, we propose two empirical studies to (1) detect Android malware and (2) classify Android malware into families. We rst (1) reproduce the results of MalBERT using BERT models learning with Android application's manifests obtained from 265k applications (vs. 22k for MalBERT) from the AndroZoo dataset in order to detect malware. The results of the MalBERT paper are excellent and hard to believe as a manifest only roughly represents an application, we therefore try to answer the following questions in this paper. Are the experiments from MalBERT reproducible? How important are Permissions for mal- ware detection? Is it possible to keep or improve the results by reducing the size of the manifests? We then (2) investigate if BERT can be used to classify Android malware into families. The results show that BERT can successfully di erentiate malware/goodware with 97% accuracy. Further- more BERT can classify malware families with 93% accuracy. We also demonstrate that Android permissions are not what allows BERT to successfully classify and even that it does not actually need it

    Improving privacy on android smartphones through in-vivo bytecode instrumentation

    Get PDF
    In this paper we claim that a widely applicable and efficient means to fight against malicious mobile Android applications is: 1) to per- form runtime monitoring 2) by instrumenting the application byte- code and 3) in-vivo, i.e. directly on the smartphone. We present a tool chain to do this and present experimental results showing that this tool chain can run on smartphones in a reasonable amount of time and with a realistic effort. Our findings also identify chal- lenges to be addressed before running powerful runtime monitoring and instrumentations directly on smartphones. We implemented two use-cases leveraging the tool chain: FineGPolicy, a fine-grained user centric permission policy system and AdRemover an adver- tisement remover. Both prototypes improve the privacy of Android systems thanks to in-vivo bytecode instrumentation
    corecore